FastQ Sequences

Sequences returned from the Illumina sequences machines are often stored in FASTQ format.

igv


Aligned Sequences

Aligned sequence reads are stored in BAM format.

igv


From unaligned sequence reads to aligned data,

To convert our unaligned sequence reads to aligned sequence reads we will need to use an aligner sotware to assign reads onto a reference genome.

We have already reviewed the FASTA format which is used to store reference genome

igv


From unaligned sequence reads to aligned data,

We have also reviewed reference genomes in Bioconductor stored in the BSgenome packages.

igv


Alignment software

In previous sessions we have seen how to handle unaligned sequence data as well as where and how to retrieve genome reference data.

We now need a suitable aligner software to place unaligned reads onto our reference genome.

Aligner softwares can be broadly placed into two categories.

  • Genomic aligner (WGS, ChIP-seq, ATAC-seq etc).
  • Splice Aware aligner (RNA-seq, Ribo-seq)

Popular genomic aligners include Bowtie, bwa, subread,GMAP.

Popular splice aware aligners include Tophat, SpliceMap, subjunc, GSNAP, STAR.


Splice alignment

A splice aware aligner is important for analysis of RNAseq where mRNA’s introns are spliced to stitch exons into continuous sequence.

igv


Alignment software in R

A few of the popular aligners are wrapped up in R/Bioconductor packages allowing us to use our aligner software from R as well as make use of some of the R/Bioconductor objects we are growing to love.

  • Bowtie - rbowtie, rbowtie2, QuasR
  • GSNAP/GMAP - gmapR
  • SpliceMap - QuasR
  • subread/subjunc - rsubread

Alignment software in R

The rsubread and gampR packages offer convenient access to subread/subjunc and gmap/gsnap on Mac and Linux but sadly are not implemented on Windows.

The QuasR package offers an interface to Bowtie and SpliceMap on Windows, Mac and Linux and so provides access to a genomic and splice aware aligner on all systems.

In this session we will focus on the QuasR package.


Data

In this session we will be making use of some public datasets from the Encode consortium.

We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment.

This RNAseq data has been generated from the human cell line GM12878 and the link to experiment can be found here or a direct link to FastQ for replicate 2 we are using can be found here.


Data

To speed up the processing for this practical I have retrieved aligned data from Encode for the sample ENCSR297UBP and extracted reads mapping to ActB gene on hg20 human genome build. I have further downsampled these reads to only 10000 reads out of the 200000 mapping to this gene.

This sampled data can be found in Data/sampledActin.fq.gz

BSGenome packages.

We will also require some reference data so lets install the BSgenome package for hg38

source("https://bioconductor.org/biocLite.R")
biocLite("BSgenome.Hsapiens.UCSC.hg38")
library(BSgenome.Hsapiens.UCSC.hg38)

QuasR genomic alignment - Sample table

The sample table requires is a tab-delimited file listing the path to fastq(s) to be aligned and the desired sample names.

FileName <- "../../Data/sampledActin.fq.gz"
SampleName <- "sampledActin"
sampleTable <- data.frame(FileName,SampleName)
write.table(sampleTable,file="sampleTable.txt",sep="\t",quote=FALSE,row.names = FALSE)
##                        FileName   SampleName
## 1 ../../Data/sampledActin.fq.gz sampledActin

QuasR genomic with FASTA file

Internally, QuasR will create a FASTA file from our BSgenome object prior to alignment.

We can also provide a FASTA file directly to the qAlign() function.

First lets extract a FASTA file from out BSgenome object. Here we will create a FASTA file for just Chr7 (the location of ActB)

library(BSgenome.Hsapiens.UCSC.hg38)
chr7hg38 <- BSgenome.Hsapiens.UCSC.hg38[["chr7"]]
chr7hg38Set <- DNAStringSet(list(chr7=chr7hg38))
writeXStringSet(chr7hg38Set,file="chr7.fa")

Aligned data in BAM

The qAlign() function will allow us to retrieve the aligned data as BAM.

Our generated BAM file will be in same directory as our fastq file. By default QuasR will add a random string to ensure we do not write over previous files.

Data/sampledActin_29a3a870bf.bam

Rsamtools

The samtools software provide command line tools to handle SAM and BAM files such as indexing and sorting.

The Rsamtools package allows us to make us of the samtools functions within R.

First we can install and load the library.

source("https://bioconductor.org/biocLite.R")
biocLite("Rsamtools")
library(Rsamtools)

Rsamtools sorting

After sorting, we can now index our sorted BAM file using the indexBAM() function.

indexBam("SortedActB.bam")
##       SortedActB.bam 
## "SortedActB.bam.bai"

Review in IGV

We can now review our BAM file in IGV.

igv

QuasR splice aware with FASTA file

We are missing reads which would align across splice junctions. To use the qAlign() function to align spliced reads we must simply specify the parameter splicedAlignment to be TRUE.

qAlign("sampleTable.txt","chr7.fa",splicedAlignment = TRUE)

Rsamtools BAM overview

We can get an overview of BAM file information using the quickBamFlagSummary() function.

quickBamFlagSummary("SortedActBSpliced.bam")
##                                 group |    nb of |    nb of | mean / max
##                                    of |  records |   unique | records per
##                               records | in group |   QNAMEs | unique QNAME
## All records........................ A |    10000 |    10000 |    1 / 1
##   o template has single segment.... S |    10000 |    10000 |    1 / 1
##   o template has multiple segments. M |        0 |        0 |   NA / NA
##       - first segment.............. F |        0 |        0 |   NA / NA
##       - last segment............... L |        0 |        0 |   NA / NA
##       - other segment.............. O |        0 |        0 |   NA / NA
## 
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
## 
## Details for group S:
##   o record is mapped.............. S1 |     9037 |     9037 |    1 / 1
##       - primary alignment......... S2 |     9037 |     9037 |    1 / 1
##       - secondary alignment....... S3 |        0 |        0 |   NA / NA
##   o record is unmapped............ S4 |      963 |      963 |    1 / 1

Review in IGV

Now if we compare to the original alignment from Encode we can identify where some of our unaligned reads may have gone.

igv

The Rsubread package

The Rsubread package offers a fast aligner for both genomic and splice aware alignment.

The Rsubread package is only available on Mac and Linux systems.

library(Rsubread)

The Rsubread package

We can now use the subjunc() function to align reads in a splice aware manner.

We must specify out index name, fastq file, desired output format and name of BAM file to be created.

subjunc("chr7","../../Data/sampledActin.fq.gz",
        output_format = "BAM",
        output_file = "../../Data/RsubreadsampledActin.bam")

The Rsubread package

Finally we can assess mapping rate.

quickBamFlagSummary("Sorted_RsubreadsampledActin.bam")
##                                 group |    nb of |    nb of | mean / max
##                                    of |  records |   unique | records per
##                               records | in group |   QNAMEs | unique QNAME
## All records........................ A |    10000 |    10000 |    1 / 1
##   o template has single segment.... S |    10000 |    10000 |    1 / 1
##   o template has multiple segments. M |        0 |        0 |   NA / NA
##       - first segment.............. F |        0 |        0 |   NA / NA
##       - last segment............... L |        0 |        0 |   NA / NA
##       - other segment.............. O |        0 |        0 |   NA / NA
## 
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
## 
## Details for group S:
##   o record is mapped.............. S1 |     9720 |     9720 |    1 / 1
##       - primary alignment......... S2 |     9720 |     9720 |    1 / 1
##       - secondary alignment....... S3 |        0 |        0 |   NA / NA
##   o record is unmapped............ S4 |      280 |      280 |    1 / 1

Time for an exercise.

Link_to_exercises

Link_to_answers